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Abstract: We wish to estimate conditional density using Gaussian Mixture Regression model 
with logistic weights and means depending on the covariate. We aim at selecting the number of 
components of this model as well as the other parameters by a penalized maximum likelihood 
approach. We provide a lower bound on penalty, proportional up to a logarithmic term to the 
dimension of each model, that ensures an oracle inequality for our estimator. Our theoretical 
analysis is supported by some numerical experiments. 
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Gaussian Mixture Regression model with logistic weights, a 
penalized maximum likelihood approach 

Resume : Nous souhaitons estimer une densite conditionelle a l'aide d'un modele de melange 
de regression gaussienne a poids logistiques et moyennes dependant d'une covariable. L'objectif 
est de selectionner le nombre de composantes dans le modele ainsi que d'estimer les autres 
parametres par une approche de type maximum de vraisemblance penalise. Nous proposons une 
borne inferieur sur la penalite, proportionelle a un facteur logarithmique pres, a la dimension de 
chaque modele, qui assure l'existence d'une inegalite oracle pour notre estimateur. Notre analyse 
theorique est confirmee par des experiences numeriques. 

Mots-cles : Estimation de densite conditionnelle, Melange de regression gaussienne, Selection 
de modeles 
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1 Framework 

In classical Gaussian mixture models, density is modeled by 

K 



SK.v,T,,w(y) = ^2^w,k^v k ,^ k (y), 



fe=l 

where K 6 N* is the number of mixture components, ^ Vj yi is the density of a Gaussian of mean 
v and covariance matrix S, 

and mixture weights can always be defined from a J^-tuple (iui, . . . , wk) with a logistic scheme: 



^w.k 



E* =1 e™*' 



In this article, we consider such a model in which mixture weights as well as means can depend 
on a covariate. 

More precisely, we observe n pairs of random variables ((.Xj, Yj))i<j<„ where covariates XjS 
are independent and YjS are independent conditionally to the X,s. We want to estimate the 
conditional density sq(-\x) with respect to the Lebesgue measure of Y given X . We model this 
conditional density by a mixture of Gaussian regression with varying logistic weights 

K 



SK,v,T,,w{y\x) = ^2^w(x),k^v k (x),S k (y)^ 



fe=l 



where (vi, . . . , vk) an d (wi, ■ ■ ■ , wk) are now iiT-tuples of functions chosen, respectively, in a set 
T k and Wk- Our aim is then to estimate those functions Vk and Wk, the covariance matrices £& 
as well as the number of classes K so that the error between the estimated conditional density 
and the true conditional density is as small as possible. 
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The classical Gaussian mixture case has been much studied [18(. Nevertheless, theoretical 
properties of such model have been less considered. In a Bayesian framework, asymptotic prop- 
erties of posterior distribution are obtained by Choi [7| , Genovese and Wasserman [12j , Van der 
Vaart and Wellner [19j when the true density is assumed to be a Gaussian mixture. AIC/BIC 

genalization scheme are often used to select a number of cluster (see Burnham and Anderson 
| for instance). Non asymptotic bounds are obtained by Maugis and Michel [16| even when 
the true density is not a Gaussian mixture. All these works rely heavily on a bracketing entropy 
analysis of the models, that will also be central in our analysis. 

When there is a covariate, the most classical extension of this model is the Gaussian mixture 
regression, in which the means Vk are now functions, is well studied as described inMcLachlan 
and Peel [18|. Models in which the proportions vary have been considered by Antoniadis et al. 
[y. Using idea of Kolaczyk et al. [14(, they have considered a model in which only proportion 
depend in a piecewise constant manner from the covariate. Their theoretical results are nev- 
ertheless obtained under the strong assumption they exactly know the Gaussian components. 
This assumption can be removed as shown by Cohen and Le Pennec |8(]. Models in which both 
mixture weights and means depend on the covariate are considered by Ge and Jiang [ll| . but 
in a logistic regression mixture framework. They give conditions on the number of experts to 
obtain consistency of the posterior with logistic weights. Note that similar properties are studied 
by Lee [15J for neural networks. 

Although natural, Gaussian mixture regression with varying logistic weights seems to be 
mentioned first by Jordan and Jacobs |13| . They provide an algorithm similar to ours, based 
on EM and IRLS, for hierarchical mixtures of experts but no theoretical analysis. Chamroukhi 
et al. [6fl consider the case of piecewise polynomial regression model with affine logistic weights. 
In our setting, this corresponds to a specific choice for T k and Wk' a collection of piecewise 
polynomial and a set of affine functions. They use a variation of the EM algorithm and a BIC 
criterion and provide numerical experiments to support the efficiency of their scheme. In this 
paper, we propose a slightly different penalty choice and prove non asymptotic bounds for the 
risk under very mild assumptions on Tjf and Wk that hold in their case. 

2 A model selection approach 

We will use a model selection approach and define some conditional density models S m by 
specifying sets of Gaussian regression mixture conditional densities through their number of 
classes K, a structure on the covariance matrices £& and two function sets T k and Wk to 
which belong respectively the X-tuple of means (i>i, . . . , vk) and the i^-tuple of logistic weights 
(wi, . . . ,Wk)- Typically those sets are compact subsets of polynomial of low degree. Within 
such a conditional density set S m , we estimate s by the maximizer s m of the likelihood 

n 

s m = argmax y~] In SK,v,s,w(Yi\Xj), 
»K,»,!:,«)£S m .j 

or more precisely, to avoid any existence issue, by any 77-minimizer of the -log-likelihood: 

n n 

y] — ]ns m (Yi\Xi) < min V" - lns K . v ^ yW (Y l \X l ) + rj. 
%—\ i—i 

Assume now we have a collection {S m } ml £M of models, for instance with different number of 
classes K or different maximum degree for the polynomials defining Tk and Wk, we should 
choose the best model within this collection. Using only the log-likelihood is not sufficient since 
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this favors models with large complexity. To balance this issue, we will define a penalty pen(m) 
and select the model fh that minimizes (or rather //-almost minimizes) the sum of the opposite 
of the log-likelihood and this penalty: 

K K 

y^ - lnSm(Yi\Xi) + pen(m) < min Y^ - lns m (Yi\X t ) + pen(m) + ?/. 

* — ' m£M * — ' 

fc=l fe=l 

Our goal is now to define a penalty pen(m) which ensures that the maximum likelihood 
estimate in the selected model performs almost as well as the maximum likelihood estimate in 
the best model. More precisely, we will prove that 

E I JKL J {s ,Sfh)\ <Ci mf inf KL W {s ,s m )-\ 1 M 

L y meM \s m es m n n J n 

where KL®" is a tensorized Kullback-Leibler divergence, JKL®" a lower bound of this divergence 
with a pen(m) chosen of the same order as the variance of the corresponding single model 
maximum likelihood estimate. In the next section, we specify all those divergences and explain 
the general framework proposed by Cohen and Pennec [9j for conditional density estimation. We 
will then explain how to use those results in our specific setting. The last section is dedicated 
to some numerical experiments conducted for sake of simplicity in the case where X £ [0, 1] and 

FeM. 

3 A general conditional density model selection theorem 

We summarize in this section the main result of Cohen and Pennec |9j that will be our main 
tool to obtain the previous oracle inequality. In this work, the estimator loss is measured with a 
divergence JKL 18 ™ defined as a tensorized Kullback-Leibler divergence between the true density 
and a convex combination of the true density and the estimated one. Contrary to the true 
Kullback-Leibler divergence, to which it is closely related, it is bounded. This boundedness 
turns out to be crucial to control the loss of the penalized maximum likelihood estimate under 
mild assumptions on the complexity of the model and their collection. 

Let KL be the classical Kullback-Leibler divergence, which measures a distance between two 
density functions. Since we work in a conditional density framework, we use a tensorized version 
of it. We define by KL®" the Kullback-Leibler tensorized divergence, 



i^KL( S (.|X ? ;),t(.|X,)) 

which appears naturally in this setting. Replacing t by a convex combination between s and t 

- «u 



KL ts "(s,t)=E 



yields the so-called Jensen-Kullback-Leibler tensorized divergence, denoted JKL? 1 ™ 



JKL? n (*,t)=E 



1 n 1 

- J2 -KL(*(.|X 4 ), (1 - p)s(.\X i ) + P t(.\X t )) 



n * — ' p 
i=i ^ 



with p g]0; 1[. This loss is always bounded by -ln-jz - but behaves as KL when t is close to 
s. Furthermore JKL?/"(s,£) < KLf n (s,t). If we let d 2 ® n be the tensorized extension of the 
squared Hellinger distance d 2 , Cohen and Pennec [9( prove that there is a constant C p such that 
C p d 2 ® n (s,t) < JKLf n (s,t). 
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To any model S m , a set of conditional densities, we associate a complexity defined in term 
of a specific entropy, the bracketing entropy with respect to the root of d 2 ® n . Recall that a 
bracket [t~,t + ] is a pair of real functions such that V(x,y) € X x y, t~(x, y) < t + (x,y) and a 
function s is said to belong to the bracket [t~ , t + ] if V(x, y) G X xy,t~(x,y) < s(x, y) < t + (x, y). 
The bracketing entropy Hn d (S, S) of a set S is defined as the logarithm of the minimal number 
N^ t d(6, S) of brackets [t~,t + ] covering S, such that d(t~,t + ) < 6. Our main assumption on 
models is an upper bound of a Dudley type integral of these bracketing entropies: 

Assumption (H) For every model S m in the collection S, there is a non-decreasing function 
4> m such that S i— > g<p m (S) is non-increasing on ]0,+oo[ and for every a G R + , 



jH[.],d®"(d, S m )dS < </) m (a) 



One need further to control the complexity of the collection as a whole through a coding type 
(Kraft) assumption. 



Assumption (K) There is a family (x m )meM of non-negative numbers such that 



e Xm < S < +00. 



For technical reason, a separability assumption, always satisfied in the setting of this paper, is 
also required. 

Assumption (Sep) For every model S m in the collection S, there exists some countable subset 
S' m of S m and a set y' m with A(3^\3^^) = such that for every t in S m , it exists some se- 
quence (tk)k>i of elements of S' m such that for every x and every y G y' m , ln(tfe(y|a;)) > 

~ k— >+oo 

Ht(y\x)). 

The main result of Cohen and Pennec [9| is a condition on the penalty pen (to) which ensures 
an oracle type inequality: 

Theorem 1. Assume we observe (Xi,Yi) with unknown conditional density sq . LetS = (S m ) m< =M 
an at most countable conditional density model collection. Assume assumptions (H), (Sep) and 
(K) hold. Let s m be a r\ -log-likelihood minimizer in S m 

n / n \ 

Y, - Hsm(Yi\Xi)) < inf V - ln^^lXi)) + t) 

Then for any p £ (0, 1) and any C\ > 1, i/iere is a constant kq depending only on p and C\ 
such that, as soon as for every index to £ M., 

pen(m) > n(naf n + x m ) 

with k > kq and a m the unique root of -<fi m (o~) = y/ncr, the penalized likelihood estimate s,^ with 
fh such that 

n / n \ 

V-ln(s, 7l (Yi|X 4 ))+pen(TO)< inf V-ln(s m (Fi|Xi)) +pen(m) )+ rf 
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satisfies 

E[JKL®"( So ,5™)] 

< Ci mf inf KLJ (s ,s m )-\ + C\ . 

meM \s m GS m n J n 



The name oracle type inequality means that the right-hand side is a proxy for the estimation 
risk of the best model within the collection. The term inf Sm£ g m KLf n (so,s m ) is a typical bias 
term while pent > m > plays the role of the variance term. We have three sources of loss here: the 
constant C\ can not be taken equal to 1, we use a different divergence on the left and on the right 
anc j P en ( m ) - m no j. (ii rec tly related to the variance. The first issue is often considered as minor 
while the second one turns out to be classical in density estimation results. Whenever pen(m) 
can be chosen approximately proportional to the dimension D m of the model, which will be 
the case in our setting, pen{ - m > i s approximately proportional to D m /n, which is the asymptotic 
variance in the parametric case. The right-hand side matches nevertheless the best known bound 
obtained for a single model within such a general framework. 

In the next section, we show how to apply this result in our Gaussian mixture setting and 
prove that the penalty can be chosen roughly proportional to the intrinsic dimension of the 
model, and thus of the order of the variance. 

4 Spatial Gaussian regression mixture estimation theorem 

As explained in introduction, we are looking for conditional densities of type 

K 



SK,v,Z,w(y\%) =^2 7T w,k(x)<^ Vk ( x )^ k {y), 



fc=l 



where K 6 N* is the number of mixture components, <fr Vt -£ is the density of a Gaussian of mean 
v and covariance matrix S, Vk is a function specifying the mean given x of the fc-th component 
while Efc is its covariance matrix and the mixture weights 7r W] fc are defined from a collection of 
K functions Wi , . . . , wk by a logistic scheme: 



7r w ,k(x) 



gWk(x) 



££ =1 e*vW 



For sake of simplicity, we will assume that the covariate X belongs to an hypercube so that 
X = [0;l] d . 

We will estimate those conditional densities by conditional densities belonging to some model 
S m defined by 

f K 

S m =\ (x,y) i-> ^2iTw,k( x )®v k (x),i; k (y)\(wi, ■ ■ ■ ,wk) e W K ,(vi,...,v K ) e T K , 
*• fc=i 

(Ei,...,Ea-)gVk 

where Wk is a compact set of X-tuples of functions from X to R, Tjf a compact set of if-tuples 
of functions from X to W and Vk a compact set of if-tuples of covariance matrix of size p x p. 
Before describing more precisely those sets, we recall that S m will be taken in a model collection 
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S = (S m ) m , where m specifies a choice for each of those parameters. The number of components 
K can be chosen arbitrarily in N* , but will in practice and in our theoretical example be chosen 
smaller than an arbitrary K mliX , which may depend on the sample size n. The sets Wk and T k 
will be typically chosen as a tensor product of a same compact set of moderate dimension, for 
instance a set of polynomial of degree smaller than respectively dw and dx whose coefficients are 
smaller in absolute values than respectively Tw and T~f . The structure of the set Vk depends 
on the noise model chosen: we can assume, for instance, it is common to all regressions, that 
they share a similar volume or diagonalization matrix or they are all different. More precisely, 
we decompose any covariance matrix E into LDAD', where L = |S| 1 / p is a positive scalar 
corresponding to the volume, D is the matrix of eigenvectors of E and A the diagonal matrix of 
normalized eigenvalues of E. Let L_,i + be positive values and A_, A+ real values. We define 
the set -4(A_, A + ) of diagonal matrices A such that \A\ = 1 and Vi <G {1, . . . ,p}, A_ < An < A+. 
A set Vk is defined by 

V K ={(L 1 D 1 AiD[, . . .,L K D K A K D' K )\ik,L_ < L k < L+, D k e SO(p), 

A k eA(X-,X+)} 
Those sets Vk correspond to the classical covariance matrix sets described by Celeux and Govaert 

!• 

We will bound the complexity term na^ in term of the dimension of S m : we prove that 
those two terms are roughly proportional. The set Vk is a parametric set and thus dim(Vfc;) is 
easily defined as the dimension of its parameter set. Defining the dimension of Wk and T k is 
more interesting. We rely on an entropy type definition of the dimension. For any if-tuples of 
functions [s\, . . . , sk) and (t\, . . . , tK), we let 

^llsuplU ((si,-.-,s K ),(h,...,t K )) = sup sup \s k (x) - t k (x)\ 

xeX l<k<K 

and define the dimension dim(Fjf) of a set Fjc of such iC-tuples as the smallest D such that 
there is a C satisfying 

H dl{supll Ja,F K )<D(c + ]n^ 

Using the following proposition of Cohen and Pennec [9| , we can easily verify that Assumption 
(H) is satisfied. 

Proposition 1. If for any 5 £ {0;\/2],H[^ d ®n(5,S m ) < D m (C m + ln(|)) ; then the function 

<t>m (c) = o-\/D m I y/C m + \[tx + a/ ln(— j-r) ) satisfies assumption (H). Furthermore, the unique 

root o~ m of —(f> m (o-) = y/na satisfies 

nal < D m ( 2( Jc7 n + V^) 2 + fin 7^==- 
We show in Appendix that if 

H d« 5UD I, (<r, W K ) < dim(W K ) (c Wk + In 1 



a 



and 



H maXk suPx || || 2 (o-, T K ) < dim(Tif ) I C Tk + In - j 
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2(v^+^) 2 +(ln U ) 



< D m (2{y/C~ + V^) 2 + ln(n)) 

<D m (C' m + Hn)) 

with C' m that depends only on the constants defining Vk and the constants Cw K and Cy K . In 
order to obtain the same constant C' m for all models, we impose that the dimension bound holds 
with the same constants for all models: 

Assumption (DIM) There exist two constants Cw and Cx such that, for every model S m in 
the collection S, 

#ma Xfc mu (a, W K ) < dim(W K ) ( C w + In - 

and 

1 



^ma Xfc su Px |||| 2 (^ T ^) < dim(T A ') I C T +ln- 

We can now state our main result: 

Theorem 2. For any collection of Gaussian regression mixtures satisfying (K) and (DIM), there 
is a constant C such that for any p £ (0,1) and any C\ > 1, there is a constant kq depending only 
on p and C\ such that, as soon as for every index m £ M., pen(m) = re((C + lnn) dim(5' m ) + x m ) 
with k > kq, the penalized likelihood estimate s^ with fh such that 



y2-]n(sffi(Yi\Xi))+pen(m) < inf [V -ln(s m (Xi\Xi)) +pen(m) )+ r/ 

i=l \i=l / 

satisfies 

E[jKL®"( S0! Sm)] 

<Ci inf mf KL^ {s Q ,s m )-\ 1 

m£M \s m £S m A n n 

In the previous theorem, the assumption on pen(m) could be replaced by the milder one 



pen(m) > k 2D m C + D m In 



C 2 D ri 

To minimize arbitrariness, x m should be chosen such that 2KXm is as small as possible. Notice 
that the constant C only depends on the model collection parameters, for instance on the maximal 
number of components K max . As often in model selection, the collection may be chosen according 
to to the sample size n. If the constant C" grows no faster than ln(n), the penalty shape can be 
kept intact and a similar result holds uniformly in n up to a slightly larger kq- For instance, as 
A ma x only appears in C through a logarithmic term, A max may grow as a power of the sample 
size. 
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We postpone the proof of this theorem to the Appendix and focus on Assumption (DIM). 
This assumption can often be verified when the functions sets Wk and T^ are defined as images 
of a finite dimensional compact subset of parameters when X £ [0,l] d . For example, those 
sets can be defined as linear combination of a finite set of bounded functions whose coefficients 
belong to a compact set. We study here the case of linear combination of the first elements of 
a polynomial basis but similar results hold, up to some modification on the coefficient sets, for 
many other choices (first elements of a Fourier, spline or wavelet basis, elements of an arbitrary 
bounded dictionary...) 

Let dw and dx be two integers and Tw and Tx some positive numbers. We define 



W = < w : [0; l] d -> R\w(x) = J^ a r x r and Halloo < T w 



|r|=0 



d-£ 



T= !v: [0;l] d ^-W Vj e {l,...,p},Vx,Vj(x) = J2 a r ^ and IMU < T r 
{ k|=o 

Let W K = {0} X W K " 1 and T K = T K . 
We prove in Appendix that 

Lemma 1. Wk an d T^ satisfy assumption (DIM), with Cw = In I \/2 + IV ( w ^~ )] and 
C T = In U/2 + Vp{ dr a d )Ty) , not depending on K. 

To apply Theorem [51 it remains to describe a collection (S m ) and a suitable choice for 
(x m ). Assume, for instance, that the models in our collection are defined by an arbitrary 
maximal number of components if max , a common free structure for the covariance matrix 
.ftf-tuple and a common maximal degree for the sets Wk and Tk, then one can verify that 
dim(5 m ) = (K - 1 + Kp)( dw + d ) + Kp^- and that the weight family (x m = K) satisfy As- 
sumption (K) with 5 < l/(e — 1). Theorem [5] yields then an oracle inequality with pen(m) = 
k ((C + ln(n)) dim(S' m ) + x m ). Note that as x m <C (C + ln(n)) dim(5 m ), one can obtain a sim- 
ilar oracle inequality with pen(m) = k(C + ln(n)) dim(S' m ) for a slightly larger k. Finally, as 
explained in the proof, choosing a covariance structure from the finite collection of Celeux and 
Govaert [5| or choosing the maximal degree for the sets Wk and T k among a finite family can 
be obtained with the same penalty but with a larger constant 5 in Assumption (K). 

5 Numerical scheme and numerical experiment 

We illustrate our theoretical result in a setting similar to the one considered by Chamroukhi 
et al. [6j|. We observe n pairs (Xi,Yi) with Xi G [0, 1] and Y\ 6 R and look for the best estimate 
of the conditional density so(y\x) that can be written 

K 

SK,v^,w(y\x) =22-n-w,k(x)®v k (x),x k (y), 

fc=l 

with w € Wk and v £ T^-- We consider the simple case where Wk and Tk comprise linear 
functions. We do not impose any structure on the covariance matrices. Our aim is to estimate 
the best number of components K, as well as the model parameters. As described with more 
details later, we use an EM type algorithm to estimate the model parameters for each K and 
select one using the penalized approach described previously. 
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0.6 0.8 



(a) 2 000 data points of example P (b) 2 000 data points of example NP 

Figure 1: Typical realizations 



In our numerical experiment, we consider two different examples: one in which true condi- 
tional density belongs to one of our models, a parametric case, and one in which this is not true, a 
non parametric case. In the first situation, we expect to perform almost as well as the maximum 
likelihood estimation in the true model. In the second situation, we expect our algorithm to 
automatically balance the model bias and its variance. More precisely, we let 



8a(v\x) 



I 



l + cxp(15a;-7) 



. exp(15x — 7) , . . 

$-15x+8,0.3(j/) + "J— 777 ^®0Ax+Q.6,0A{y) 



in the first example, denoted example P, and 

1 



so(y\x) 



1 + exp(15a; - 7) 



exp(15:r — 7) 



I + cxp(15x-7) 



®15x 2 -22x+7A,0.3(y) + 7— 7 

1 + exp( 



15a; - 7) 



$_ 



0.4s 2 



,0.4(2/) 



in the second example, denoted example NP. For both experiments, we let X be uniformly 
distributed over [0, 1]. Figure Q] shows a typical realization for both examples. 

As often in model selection approach, the first step is to compute the maximum likelihood 
estimate for each number of components K . To this purpose, we use a numerical scheme based 
on the EM algorithm llO| similar to the one used by Chamroukhi et al. [6j. The only difference 
with a classical EM is in the Maximization step since there is no closed formula for the weights 
optimization. We use instead a Newton type algorithm. Note that we only perform a few Newton 
steps (5 at most) and ensures that the likelihood does not decrease. We have noticed that there 
is no need to fully optimize at each step: we did not observe a better convergence and the 
algorithmic cost is high. We denote from now on this algorithm Newton-EM. Figure [5] illustrates 
the fast convergence of this algorithm towards a local maximum of the likelihood. Notice that 
the lower bound on the variance required in our theorem appears to be necessary in practice. 
It avoids the spurious local maximizer issue of EM algorithm, in which a class degenerates to a 
minimal number of points allowing a perfect Gaussian regression fit. We use a lower bound of 

iS. Biernacki and Castellan [3j provide a more precise data-driven bound: ™ 1 ° 1 ~' <:i ~ f ? 1 _' u/k\ ; 

with Xn-2K+i t ne chi-squared quantile function, which is of the same order as — in our case. In 
practice, the constant 10 gave good results. 
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Figure 2: Increase of the Log-likelihood of the estimated density at each step of our iterative 
Newton-EM algorithm in the example NP with 3 components and 2 000 data points. 



An even more important issue with EM algorithms is initialization, since the local minimizer 
obtained depends heavily on it. We observe that, while the weights w do not require a special 
care and can be simply initialized uniformly equal to 0, the means require much more attention 
in order to obtain a good minimizer. We propose an initialization strategy which can be seen as 
an extension of a Quick-EM scheme with random initialization. 

We draw randomly K lines, each defined as the line going through two points (Xi, Yi) drawn 
at random among the observations. We perform then a K-means clustering using the distance 
along the Y axis. Our Newton-EM algorithm is initialized by the regression parameters as well 
as the empirical variance on each of the K clusters. We perform then 3 steps of our minimization 
algorithm and keep among 50 trials the one with the largest likelihood. This winner is used as 
the initialization of a final Newton-EM algorithm using 10 steps. 

We consider two other strategies: a naive one in which the initial lines chosen at random 
and a common variance are used directly to initialize the Newton-EM algorithm and a clever 
one in which observations are first normalized in order to have a similar variance along both 
the X and the Y axis, a K-means on both X and Y with 5 times the number of components is 
then performed and the initial lines are drawn among the regression lines of the resulting cluster 
comprising more than 2 points. 

The complexity of those procedures differs and as stressed by Celeux and Govaert [5J| the 
fairest comparison is to perform them for the same amount of time (5 seconds, 30 seconds, 1 
minute...) and compare the obtained likelihoods. The difference between the 3 strategies is not 
dramatic: they yield very similar likelihoods. We nevertheless observe that the naive strategy has 
an important dispersion and fails sometime to give a satisfactory answer. Comparison between 
the clever strategy and the regular one is more complex since the difference is much smaller. 
Following Celeux and Govaert [5J, we have chosen the regular one which corresponds to more 
random initializations and thus may explores more local maxima. 

Once the parameters' estimates have been computed for each K, we select the model that 
minimizes 



E 

i=l 



ln(s m (Yi\Xi)) + pen(m) 



with pen(?7i) = Kdiva(S m ). Note that our theorem ensures that there exists a k large enough for 
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(a) Example P with 2 000 points 



(b) Example NP with 2 000 points 



Figure 3: Slope heuristic: plot of the selected model dimension with respect to the penalty 
coefficient k. In both examples, k is of order 1/2. 



which the estimate has good properties, but does not give an explicit value for k. In practice, 
k has to be chosen. The two most classical choices are k = 1 and k — ^p which correspond 
to the AIC and BIC approach, motivated by asymptotic arguments. We have used here the 
slope heuristic proposed by Birge and Massart and described for instance in Baudry et al. [2j . It 
consists in representing the dimension of the selected model according to n (fig [3]) , and finding k 
such that if n < k, the dimension of the selected model is large, and reasonable otherwise. The 
slope heuristic prescribes then the use of k = 2k. In both examples, we have noticed that the 
sample's size had no significant influence on the choice of k, and that very often 1 was in the 
range of possible values indicated by the slope heuristic. According to this observation, we have 
chosen in both examples k = 1. 

We measure performances in term of tensorized Kullback-Leibler distance. Since there is 
no known formula for tensorized Kullback-Leibler distance in the case of Gaussian mixtures, 
and since we know the true density, we evaluate the distance using Monte Carlo method. The 
variability of this randomized evaluation has been verified to be negligible in practice. 

For several numbers of mixture components and for the selected K, we draw in figure|3]the box 
plots and the mean of tensorized Kullback-Leibler distance over 55 trials. The first observation 
is that the mean of tensorized Kullback-Leibler distance between the penalized estimator s^ 
and so is smaller than the mean of tensorized Kullback-Leibler distance between sk ans sq over 
K e {1, . . . , 20}. This is in line with the oracle type inequality of Theorem [5J Our numerical 
results hint that our theoretical analysis may be pessimistic. A close inspection show that the 
bias-variance trade-off differs between the two examples. Indeed, since in the first one the true 
density belongs to the model, the best choice is K = 2 even for small n. As shown on the 
histogram of Figure this is almost always the model chosen by our algorithm. Observe also 



that the mean of Kullback-Leibler distance seems to behave like 



i(S„ 



(shown by a dotted 



line). This is indeed the expected behavior when the true model belongs to a nested collection 
and corresponds to the classical AIC heuristic. In the second example, the true model does not 
belong to the collection. The best choice for K should thus balance a model approximation error 
and a variance one. We observe in Figure [5] such a behavior: the larger n the more complex the 
model and thus K. Note that the slope of the mean error seems also to grow like ' ™ n even 
though there is no theoretical guarantee of such a behavior. 

Figure [5] shows the error decay when the sample size n grows. As expected in the parametric 
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(c) Example NP with 2 000 data points 



(d) Example NP with 10 000 data points 



Figure 4: Box-plot of the Kullback-Leibler distance according to the number of mixture com- 
ponents. On each graph, the right-most box-plot shows this Kullback-Leibler distance for the 
penalized estimator sg- 
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(c) Example NP with 2 000 data points 

Figure 5: Histograms of the selected K 



Selected number of classes 

(d) Example NP with 10 000 data points 
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- - - linear regression of E[KLj 



(a) Example P. The slope of the free regression line is(b) Example NP. The slope of the regression line is ~ 
~-l,3 -0,6. 

Figure 6: Kullback-Leibler distance between the true density and the computed density using 
(Xi,Yi)i<N with respect to the sample size, represented in a log-log scale. For each graph, we 
added a free linear least-square regression and one with slope — 1 to stress the two different 
behavior. 



case, example P, we observe the decay in t/n predicted in the theory, with t some constant. 
The rate in the second case appears to be slower. Indeed, as the true conditional density does 
not belong to any model, the selected models are more and more complex when n grows which 
slows the error decay. In our theoretical analysis, this can already be seen in the decay of the 
variance term of the oracle inequality. Indeed, if we let mo{ri) be the optimal oracle model, the 
one minimizing the right-hand side of the oracle inequality, the variance term is of order 



D„ 



ii"" 



which is larger than i as soon as D mo i n \ — > +oo. It is well known that the decay depends on 
the regularity of the true conditional density. Providing a minimax analysis of the proposed 
estimator, as have done Maugis and Michel [171], would be interesting but is beyond the scope of 
this paper. 



A Proof of Theorem [2] 

In this section, an overview of the proof of the model selection theorem, applied to our Gaussian 
regression mixture, is given. [B]is dedicated to the example with polynomial means and weights. 
The constants in the Assumption (DIM) and the theorem are specified. Then, in [Cl we provide 
more details on the proofs and lemmas used in the first section. 

We will show that Assumption (DIM) ensures that for all S e [0 ; \/2] , H^ d »n(5, S m ) < 
D m (C m +ln(j-)) with a common C m . If this happens, Proposition Q] yields the results. In other 
words, if we can control models' bracketing entropy with a uniform constant £, we get a suitable 
bound on the complexity. This result will be obtain by first decomposing the entropy term 
between the weights and the Gaussian mixtures. Therefore we use the following distance over 
conditional densities: 



swpd y (s,t) 



sup 



\/s(y\x) - \/t{y\x)\ dy 
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Notice that d 2 ® n (s,t) < sup x d 2 y (s,t). 
For all weights tt and n' ', we define 



supd fc (7T,7r') = sup y^ I y/-K k {x) - Jir' k (x) 



K / , x 2\ 2 

X ^ X \fc=l v 

Finally, for all densities s and £ over y, depending on x, we set 
supmaxcLfs, t) = sup max <jL(sh{x, .),tk(x,.)) 

x k yV X £Xl<k<K 



sup max 
xexi<k<K 



y\/sk(x,y) - y/t k {x,y)j dy 



Lemma 2. Let V = UTT w ^)i<k<K /w E W K , and \/{k,x),-K w ^{x) = „% ^m } a^d 
= 1 (^ > «fc.s fc ) 1<fc </ < - /u G Yr-, S € Vr- >. Then for all S in [0;y2], for all m in M, 

H[.\,sa.pd v (S,S m ) < H[.] tSap d k ( k'^J + -^[.],supmaxd s ( r,<? 

One can then relate the bracketing entropy of V to the entropy of Wk 
Lemma 3. For all 5 e [0; y/2], 

""■?* (f *•) ^T»- (^-^) 

Since V is a set of weights, 3v ^L could be replaced by 3 y^_ with an identifiability con- 
dition. For example, W' K = {(0, W2 — w\, . . . , wk — Wi)\w £ Wk} can be covered using brackets 
of null size on the first coordinate, lowering squared Hellinger distance between the brackets' 
bounds to a sum of K - 1 terms. Therefore, H llsupdk (§,"P) < H m ^ Woo ( ^fe .W^j. 

Since we have assumed that 3D\y K , Cw s -t V<5 £ [0; v2]) 



fln-xllll. (<*> Wir) < fl *K (Cw + In (jj) 



Then 



To tackle the Gaussian regression part, we rely heavily on the following proposition, 
Proposition 2. Let « > |L 7« = — - 2 2 . ForanyO < 6 < \/2 and any 5^ < — T = — L===A 

y 49(1+^) 5-y/K 2 COsh(^) + i P 

(v,L,A 7 D) eTx [L_,L + Lx^(A_,A+)xSO(p) and (u, I, i, D) eTx [L_,L+] x .A(A_, +00) x 
SO(p),H = LB AD' and S = LDAD' , assume that t~(x,y) = (1 + K^£) _p 3?v(» (i+j e )-ie(J/) 
andt+(x,y) = (1 + Kfe) p $ f , (:r ),(i +( 5 s )£(z/)- 
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If 



Vxe 



Hx)-{;(x)|| 2 <P7 K L_A_^ s 2 



(1 + ^)- l L < L < L 
y y <=W>,\\Dy-Dy\\<±:±Sv\\y\\ 



then [t ,t + ] is a | Hettinger bracket such that t (x,y) < ^ v (x),s(y) < t + ( x ,y)- 

We consider three cases: the parameter (mean, volume, matrix) is known (* = 0), unknown 
but common to all classes (* — c), unknown and possibly different for every class (* = K). For 
example, [vk, Lq, Dc, Aq] denotes a model in which only means are free and eigenvector matrices 
are assumed to be equal and unknown. Under our assumption that Dy K , Cx s.t VS € [0; y2], 



H maXkSU p x \\.\\ 2 (d,T K ) < D Tk ( C t +ln f - 



we deduce: 



where T> 




V V 5 5A - 

Z v ,k = Dy K > %v,c = Dy 1 , Z vfi = 

Zl,o = Zd.q = Za.q = 0, 

Zl.c = Zd.c = Za,c = 1, 
Zl,k = Z^.k = Za.k = K 



(1) 



We notice that the following upper-bound of C is independent from the model of the collection, 
because we have made this hypothesis on Cx- 



C < In I 5p\ k 2 cosh 
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We conclude that #[.], suPx d y (6, S m ) < D m (C m + In (±)), with 



D m — Dw f 



V 



c,„ = ^{c w + J WK 



D„ 



< C w + hi 



/20V^ 



V 3^/3 



-11+^ 

3V3 J J D m 

d ■.= e: 



Note that the constant €. does not depend on the dimension D m of the model, thanks to the 
hypothesis that Cw is common for every model S m in the collection. Using Proposition [TJ we 
deduce thus that 



r 2 m <D m 2(Ve + V¥ 



In 



y/l+y/^) D 



Theorem Q] yields then, for a collection S = {S m ) m( zM^ w ^ tn -^ = {(K> Wk,^k, Vk)\K G 
N*,Wk,Yk,Vk as previously defined } for which Assumption (K) holds, the oracle inequality 
of Theorem [2] as soon as 



pen 



(to) > k \D m 2 f V<£ + Vtv) + In 



V^+V^) D. 



B Proof of Theorem for polynomial 

We focus here on the example in which Wk and T k are polynomials of degree respectively at 
most dw and dy. 

By applying lemmas [TJ [3] and [TJ we get: 

Corollary 1. 



20 
3-v/T 



<(K-l)( dw + d 



ln(V2+^T w VK^T' dw - 



C w + In [ ^=\[K~- 1 ) + In ( - t 



20 
V3V3' 



ffi 



[ -z,Q J < 2? I C + ln . 



"] 
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with 



p(p+ 1) 
V = D t k + if „ , D t k = pK 



cr) 



c = 



2D t k + Kp(p + 1) 



^ kCt + ^ /2^A + (,-cosh(f) + I) 



+K 



ln(c(/) + In 



4+ 129 In (j± 



10 



p(p-l). /10A+ 



In 



7kL-A^ 



fcHm 5 J« 2 cosh(f 



'4 52A, 



V a ; +(p - 1)ln U + 5A_ 




In ^ 




Just like in the general case, we define C± by: 



2 I 7k^-A1 



+ 



P(P + 1) 



p+1 



]n(cu) + In 



+ P _i ln fiox + 



A. 



4 + 129 In (-^± 
10 



In I 5pWK 2 cosh I — - 



/4 52A +1 
^-Ws + IaTNa 




and remind that £ = Civ 
Cw = In (y/2 + T, 



We recall that 



+ In MOVKn^x A + ^^ ig an U pp er _^ oun ^ f or ^ 

W( dw / d )) and C t = In (\/2 + ^/pTx ( dT d +d ) ) , and observe that £ does not 
depend on the model S m in the collection since £ only depends on A" max , TV, dw, Tx, dx,P-, d, k 
and the parameters defining Vk- Then we can apply the result in the general case to the 
collection (S m ) in which each model is defined by a number of components K, a common free 
structure on the covariance matrix if-tuple and a common maximal degree for the sets Wk 
and Tj<-' (x m = K) m( zM satisfies Kraft inequality, since X^meA-i &~ Xm < — rp We obtain 
an oracle inequality with pen(m) = k ((C + ln(n)) dim(5 m ) +x m ), where C = 2(v£ + \Z^) 2 , 
dim(S m ) = (K - 1 + Kp)( dw d +d ) + Kp^- and x m = K for the selection of the number of 
components in the mixture. If we change the structure Vk over the covariance matrices, it only 
changes the constant 2 in Kraft inequality, since there a finite number of possible structures for 



a fixed K and the sum ^2 



m£M 



can be rewritten YIken* S 



jm£M\m{l)=K ' 



C Lemma Proofs 

In this section, we provide the proofs of the main lemmas used in the first appendix, to prove 
Theorem [5J It begins with bracketing entropy's decomposition, then we focus on the bracketing 
entropy of the weight's families in the general case and in our example, followed by the analysis 
of the bracketing entropy of Gaussian families. 
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C.l Bracketing entropy's decomposition 
Lemma 4. Let 

V = \ n = (7Tk)i<Jk<x/VA;, 7T fc : X -> R+ and VieA",^ 7r fe (:r) = 1 I , 
*= \(ip U ---,'>pK)/Vk,ip k :Xxy^R + , andVx^k, f ^ k (x,y)dy = l\ , 



C= I (x,y) ^^2,-K k {x)il)k{x,y)lK €V,il> e tf I ■ 

Then for all 5 in [0; y/2], 

H[.],Bapd y (<5, C) < H[.], s »pd k ( 5'^ ) + #[■], sup max d H ( 5'* ) 

The proof mimics the one of Lemma 7 from [9j . 



Proof. First we will exhibit a covering of bracket of C. 

l<i<N-p 



Let ([tt 1 ' , 7r J:+ ])i<i<7v P be a minimal covering of 5 bracket for supdfc of V: 



Vie {l,...,N<p},Vx€ X,d k (ir l '-(x),ir h+ {x)) < 6. 
Let ([V' 4 ' _ ,'0 i ' + ])i<j<jv 4 , be a minimal covering of 5 bracket for sup max d y oi^: Vi G {1, . . . ,N\j,},Vx <G 

X « 

X,yk G {1, . . . ,K}, d y (ipk~(x, .),if>k (x, .)) < S. Let s be a density in C. By definition, there is 
7r in T 5 and ip in ^ such that for all (cc, y) in X x y, s(y\x) = J2fc=i 7r fe( a; )' ! /'fe( x j 2/)- 
Due to the covering, there is i in {1, ... , N-p} such that 

Vx€X,\/k€{l,...,K},7ri-(x) < n k (x) < n]l + {x). 

There is also j in {1, . . . , Nq,} such that 

Vx g x, vfc g {i, . . . , ^}, vy g y, i#-(s, y) < Mx, v) < *li' + ( x > *)• 

Since for all x, for all k and for all y, 7Tfc(x) and ijj k (x,y) are non- negatives, we may multiply 
term-by-term and sum these inequalities over k to obtain: 

K K 

VxeX,Vyey,J2(*r(xJ) (i#~ (*,*/)) < *(y|s) <£7r£ + (s)^- + (x,y). 



,fe=i fe=i 



is thus a bracket covering of C. 

l<i<N v 
1<J"<JV* ,-, 

Now, we focus on brackets' size using lemmas from (9| (namely Lemma 11, 12, 13), To lighten 
the notations, n^ and ^ are supposed non-negatives for all k. Following their Lemma 12, only 
using Cauchy-Schwarz inequality, we prove that 

( K K \ 

x \k=l fc=l / 

< supdl k (ir-(x)i(j-(x,.),ir + (x)ip + (x,.)) 
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Then, using Cauchy-Schwarz inequality again, we get by their Lemma 11: 
s>updl k (ir~(x)il;-{x,.),Tr + (x)ip + (x, .)) 



< sup maxd y (V>^(x, -),tp k (x, .)) A 



K 



\ fc=i 



+d k (TT + (x),7t (i))max. / tp k (x,y)dy 



According to their Lemma 13, Vx, 2fe=i n k( x ) — 1 + 2(a/2 + \/3)<5. 
sup I maxdy(ip k (x,.),ip k (x,.)) A 



A" 



\ fc=i 



+d fc (7T + (x),7r (x))maxW / ^ fe {x,y)dy 



< (y/l + 2(y/2 + y/3)6+l) S 2 

< (55) 2 

The result follows from the fact we exhibited a 5(5 covering of brackets of C, with cardinality 
N-pNy. □ 

C.2 Bracketing entropy of weight's families 
C.2.1 When Wk is a compact 

We demonstrate that for any 8 € [0; v2], 



Hi 



M.supdfc I r! 



P < #„ 



( 3V35 ' 



Proof. We show that V(io, z) e (W^) 2 ,Vfc e {1, . . .,if},Vx £ #, |y / 7r^(x) - ^, fc (x)| < 
F(k,x)d(w, z), with F a function and d some distance. We define Vfc,Vu € WL K ,Ak(u) — 

S^Sfe) ' s ° ^^ = 4(w(l)) - 
V(u,w)e (IR K ) 2 , 



y^M-v^R 



V (V^4fc) (w + t(« - u)).{v - u)dt 



Besides, 



V (yX) (u) = Q V^R^(ln(A-H))) 



-V^fe(w)(4,;-Ai(u)) 



<KK 



KKK 
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V^M-V^fcN 



K 



< 



< 



y/ A k {u + t{v - u))^2, ( § k,i - Ai(u + t(v - u))) (vi - m)dt 
1=1 

,1 K 

- / y/A k (u + t{v - u)) J2 \ § k,i -Ai(u + t(v - u))\ \(vi - Uj)| d£ 
^° (=1 

l|l "" l|o ° / ^A fc ( u + i(w - u)) ^ |fc,, - A ( ( u + 1(« - u))\ dt 



i=i 



t,K V^A" 



A' 



Since Vu G R K ,ELi M*) = h Ei=i l**,l - M«)\ = 2 ( l - M^)) 



\/A k (v) - VAkJu) < \\v-uWoo / y/A k (u + t(v - u)) (1 - A k (u + t(v - u))) dt 

Jo 

since x h-> s/x(l — x) is maximal over [0;1] for x = |. We deduce that for any (tu, z) in (Wk) 2 , 
for all fc in {1, . . . , K}, for any x in X, \y/n Wtk (x) - y/ir z ,k( x )\ - 575 max ' ll w ' - z dloo- 

By hypothesis, for any positive e, an e-net AT of Wk may be exhibited. Let w be an element 
of Wk- There is a z belonging to the e-net M such that max; ||z; — wv\ x < e. Since for all k in 
{!,..., if}, for any x in A", 



\\]'K w . k (x) - \Jir z ,k{x)\ < —y=max\\wi - ziW^ < 



3\/3 



and 



J2 ( v ^- fc(x) + 77f e " V ^- fc( - T) + T7f e 



"V3 £ 



fc=l 

2 



3^ 



3V3 



■*&)'■ 



^ + 573 e 



tfmax, 1111.(^,^4 



zeAf 



is a 4 ^ vK -bracketing cover of V . As a result, H^ 



.sup,,. d fc { 5 ? ' J S 



D 



C.2.2 When Wr: = {0} <g> W^" 1 with W a set of polynomials 

We remind that 

dw 

W ■■= < w : [0; l] d -> M/u;(.t) = ^ a r x r and ||a||oo < T M 

|r|=0 



Proposition 3. For all 6 e [0; \/2}, 

;w(^W-d(V 



ln(\/2- 



20 



3v ,T W v7m v 



dw + d 



'] 
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Proof. Wk is a finite dimensional compact set. Thanks to the result in the general case, we get 



H[.],tmpd k ( 5'^ ) - #max|| 



<H\\ 



(3V36 \ 



3V36 



20yflT=T( dw + d ) 



\a^ K -^ +d )/ Moo <T w } 



<-^ivH + 20 ™ w/dy 



<(K-1) 



d\v 



* + JfeyVF=l(<*] 



hi 



The second inequality comes from: for all w, v in Wk-, 

max fc \\wk - WfcHco < max fc J2\^=o \ a k,r - Pk,r\ < ( d " d +d ) max fc , r \(Xk, r ~ Pk,r\- 

C.3 Bracketing entropy of Gaussian families 
C.3.1 General case 

We rely on a general construction of Gaussian brackets: 

25(k-|) 



D 



Proposition 4. Let k > ||, 7« 

Sy< 



49(1 



2 k ' 
5 ' 



For any < 5 < \/2, any p > 1 and any 



let (v,L,A,D) e T x [L_,L + ] x _4(A_,A+) x 50(p) and (v,L,A,D) € T x [£_,£+] x 
.4(A_,+oo) x SO(p), dearie S = LDAD' and S = LDAD' , 



If 



t (x,y) = (l + K5x) p $v (xUl+Ss) -i-E(y) and t + (x,y) = (1 + ^) p % (x)Xl+Ss)t (y). 



Vx G X, \\v{x) - u(a;)|| 2 < P7k L_A_^<5| 



Vy€ R p ,||Z?2y-L»y|| < 



10 A+ 



tell 2/ 1 1 



then [t , i + ] is a 8/5 Hettinger bracket such that t (x,y) < & v (x),s:(y) < t + (x,y). 

Admitting this proposition, we are brought to construct nets over the spaces of the means, 
the volumes, the eigenvector matrices and the normalized eigenvalue matrices. We consider 
three cases: the parameter (mean, volume, matrix) is known (* = 0), unknown but common 
to all classes (* = c), unknown and possibly different for every class (* = K). For example, 
[vki La, Dc, Aq] denotes a model in which only means are free and eigenvector matrices are 
assumed to be equal and unknown. 
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If the means are free (* = K), we construct a grid Gr K over Yk, which is compact. Since 

1 



H„ 



/p7 K L_A_^fe,T K ) <D Tk I 



P7«i_A_ j-5-e 



Gr K f Jjry K L_\_— <5 E 



-Dt,.- 



< C T + In 



P7 K i_A_ 1 — 4 



If the means are common and unknown (• = c), belonging to Ti , we construct a grid Gy c ( \/p r y K L-X--r— Ss 
over Ti with cardinality at most 

C T +ln 



D Tl 



Finally, if the means are known (* = 0), we do not need to construct a grid. In the end, 



jry K L^X- T -S 1 



X- 



Gr, U/P7«i-A_v-fe 



< C T + In 



P7 K L_A_— <5 Z 



with Z„ iif = D Tk , Z v . c 



Dx 1 and Z v q = 0. 

Then, we consider the grid Gl over [L-,L + \: 



Gi I 25 & 



< 1 



ln(£± 



ln(l + ^ s ) 



Since & < §, In (1 + ^fc) > j§fe. 



Gi ' 25 & 



129 In f^ 4 +129 In (-^ 

1 H ^ ^ < ^ — 

10<fe ~ lOfe 



<1 + 



By definition of a net, for any D <E SO(p) there is a U e Go ( TTiT - fc ) such that Vy e 



1 A_ 



\ \\Dy — Dy\\ < ^j — <fe||y||- There exists a universal constant cy such that 



g -d ( Torres 



< 



„ p(p-i) 

/ 10A+ \ 2 

For the grid Ga, we look at the condition on the p — 1 first diagonal values and obtain: 

p-i 



'^AT fe 



l + T6^ 



1 A_ 



Since <fe < f , In 1 + ^sffc) > ^fc, then 



5 A_ 



( 1^^ E 



^^ne^-M^rGr 1 
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Let Z L x) = Z D x) = Za,o = 0, Zl, c = Z D ^ C = Za. c = 1, Zl,k = Zd,k = Za,k = K. We 

{0^ (uo,i,...,uo,i) if *=0 
v t-¥ (v, . . . , v) if * = c and similarly fi,*, Id,* 

(vi,...,Vk) i-> (vi,...,vk) if *= A" 
and /a,*, respectively from (R+) L '* into (K+) , from (SO(p)) D '* into (SO(p)) and from 
^l(A_,A + )^* into.4(A_,A+) K 
We define 

T : (vi,...,i;if,£i,...,£jr,Di,...,Djf,Ai,...,4K) ^ (i>fc,£fc-Dfc^.fc-Dfc)i<fc<if 

and * : K,£ fc )i<fc<x ^ (*„ fc ,sJi<fc<K- The image of T* x [L-,L + ] Zl -* x SO(p) Zo * x 
-4(A_, A + ) Za '* by * o T o (/„ )A (gi / L ^ ® /o )it (gi /a,*) is the set £ of all K-tuples of Gaussian 
densities of type [u*, £*,!}*, ^4*]. 
Now, we define B: 



(Vfe,Sjfe)l<fe<Jf ^ ((1 + K<5 S ) P $t, fc ,(l+fe)-i2 fc ,(l + «5E) P *^,(l+fa)sJl<fc 



<if ' 



The image of G Tt x G*f I "* x G^ D * x G^ A * byBoTo (/^ ® / i]A ® /^ / A) *) is a <5/5-bracket 
covering of C/, with cardinality bounded by 

A7oxp(Ct) V T " / r 4+1291n(^) ^ L - fe /lOA^^*- 



Jm K L-XL5x 






Taking 8y\ = — , 1 , -. we obtain 

8 S 5^cosh(^) + iP' 



#[.] ,sup x max, d y ( g , Q J < V ( C + In 

with 2? = Z v> * + Z L ^ -\ Z D ,* + (p— 1)Za,* and 



C = In I 5pJ K*cosh (¥) + I ) + %^I + fell/ 



5 / 2 / 2? 22? VP7« l - A - 

^ /4 + 129l n(fe)\ ^ , + fci) ln ^ 

2? 1 10 I 2? V 2 V A - 

^( P -l) 1 /4 + 52A ±ln /A ± 



2? V 5 5A - VA_ 

C.3.2 With polynomial means 

Using previous work, we only have to handle Ik's bracketing entropy. Just like for Wk, we aim 
at bounding the bracketing entropy by the entropy of the parameters' space. 



RR n° 8281 



Gaussian Mixture Regression model with logistic weights, a penalized maximum likelihood approach27 

We focus on the example where Tjf = T K and 
{ v : [0; l} d -> R* 



f/y 



Vj g {l,...,p},Vx,^(z) = ^ o^V and ||a|U < T T 

|r|=0 



We consider for any v, v in T and any x in [0; 1] 



\\v{x)-v{x)\\l = j^[Y j (a^-^)x r 

j=l \|r|=0 



</y 



3=1 \|r|=0 / \|r|=0 

E E («r W ^ W 

3 = 1 |r|=0 

<p( dT + d ymax(a«-/3«) : 



So, 



Hi 



[.],max fc sup x |||| 2 I"; *■ -KV — -"max fc j r | .| 



^(V) 



fj fe) 

|r|<rf T 
l< k < K 



Hloo <T T 






'*;>)+-g 



< D Tif C T + In 



with D Tk = P K( d ^ d ) and C T = In (v/2 + ^p{ d ^ d )T T 



C.4 Proof of the key proposition to handle bracketing entropy of Gaus- 
sian families 

C.4.1 Proof of Proposition [4] 

Proof. [t~,t + ] is a 5/5 bracket. 

Since (1 + fejlT 1 - (1 + (fe)" 1 ^" 1 = ((1 + fc) - (1 + fe) -1 )^" 1 is a positive-definite matrix, 

Maugis and Michel's lemma can be applied. 

Lemma 5. (J la]) Let ^ Vl .s 1 and $ U2 ,s 2 be t wo Gaussian densities with full rank covariance 
matrix in dimension p such that Ej - — £^ is a positive definite matrix. For any y £ W , 



$vi,Ei{y) ^ /|S 2 | (l, Wv v s-l/ 
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Thus, Vie X,Vy € R p , 



t-(x,y) _(l + /t&)- p ^(x),(i+fe)-iE(y) < 1 J (1 + 5k)p 



5 S V f 1 + fe 



For all x in A". 



d 2 y (t-,t+)= J t-(x,y)dy + J t+(x,y)dy-2J y/t-(x,y)y/t+(x,y)dy 

= (1 + K 5 s )-p + (1 + K fe) p - 2(1 + K 6x)- p/2 (l + k5v) p/2 

= (1 + «5 E )- p + (1 + k5v) p - (2 

-^ (*«(*),(i+fc)-ifi(y)»*«(*),(i+fc)s(v))) • 

Using the following lemma, 

Lemma 6. Lei f&^j.Sj and $^ 2 .s 2 oe two Gaussian densities with full rank covariance matrix in 
dimension p, then 

d 2 (*„ ltEll $„ 2>Ea ) = 2 (l - 2P/ 2 |E 1 S 2 |- 1 /4| Sr i + E^ 1 !- 1 / 2 
x exp f ~t(wi - "2)'( s i + S 2 ) _1 (ui - v 2l 

we obtain 

dg(*-,*+) = (1 + «&)-* + (1 + K <5 S ) P - 2 2^ 2 ((1 + &) + (1 + fc)- 1 )"" 73 

= 2-2 2^ 2 ((1 + &) + (1 + fe)- 1 )~ P/2 + (1 + «&)"* - 2 
+ (1 + «<fe) p 

Applying Lemma [7] 

Lemma 7. For any < S < V2 and any p > 1, let k > | and 

<J S < — , L= =i, fften 

' — <" - 

— 5p — 5 
and 
Lemma 8. For any p € N* 7 /or any <5s > 0, 

2-2^ 1 ((l + <5 s ) + (l + fer 1 )- p/2 <^<^ 

Furthermore, if p5s < c, i/ien 

(1 + k5x) p + (1 + Kfe)- p - 2 < k 2 cosh( K c)p 2 fe 2 . 
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<T 2 



with c = i, it comes out that: 

sup d 2 y (r(x,y),t + (x,y)) < ( ^ 

Now, we show that for all x in X, for all y in R p , t~(x,y) < ^ v (x),s{y) < t + (x,y)- We use 
therefore Lemma thanks to the hypothesis made on covariance matrices. 

Lemma 9. Let(L,A,D) G [L_,L + ]xA(\-, \+)xSO(p) and(L,A,D) G [L_, L+] xA{\-, oo)x 
SO(p), define E = LDAD' and E = LDAD' . If 
\l + 5 L )- l L < L<L 

^1< 1 <pAAt}-A-1\<5 a \Z 1 
yyEW,\\Dy-by\\<8 D \\y\\ 
then (1 + fe)!]- 1 - E" 1 and E" 1 - (1 + <5 S ) _1 £ _1 satisfy 

Vy G R p ,j/ ((1 + fe)E- x - E- 1 ) y > Z" 1 ((fe - ,5 L )A; 1 - (1 + 5 J: )XZ 1 (25 D + S A )) \\y\\ 2 
\fy G R*V (E- 1 - (1 + fe)- 1 ^ 1 ) y > ^- (feA; 1 - A: 1 (2<5 D + <J A )) ||y|| 2 



fa = j$fa 

5d = fa = ttttt^e 



Using 

10 A + 

we get lower bounds of the same order: 



My G W,y' ((1 + SvjfT 1 - E- 1 ) y > ^faWvf 

Vy 6 RV (e- - (1 + fc)-^) y > J^ * fc||y|| a 



Let's compare $„.s and i + . 

®v(x),x(y) 

(1 + «*E) p *fi (x) , ( l +fa) £(l/) 



< (1 + «&)"' J I(1 ^ )S| ex P Q(w(x) - 0(x))' ((1 + fe)E - E) * („(*) - C(x)) 

- { (i + + S Xr ( V H exp G (u(a ° ~ {}(:c)) ' ( (1 + fe)l] ~ s ) ' (t,(x) ~ e(s)) 



But. 



(l + fe)E-E) '= f(l + fe)E(E- 1 -(l + fe)- 1 E- 1 )E) 



-lv-l/V-l Cl i X \-l\S-l\-lvi-l 



(l + 5 s )- i E- i (E- i -(l + r5 E )- i E 
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Thus by Lemma |H1 

(v(x) - £>(»))' ((1 + <5 S )£ - E) ' (v(s) - C(x)) 
< (1 + fe)"^:^:^! + <5 s )LyA + 5 s 1 i- 1 A: 1 || W ( a ;) - C(x)|| 2 

<^ L= l A= 2 A+(5s l|| u(:c) _ {}(a;) ||2 



10 

y 

10 



< — LZ l \Z 2 \+dx 1 Pl K L-\i\+ 1 5§ : 



< yP7«:fe 



iV / 2 ^ /2 



since ^y=u ^ 1+ 25 fe 



*«(x),s(y) ^ (i + fc) p/2 (i + ife)p/ 2 /fry, 



< " -ir^-TT^^ exp -^ 



(i + «*E)p* fi(x)i(1+fc) £(y) " (i + «&)* "V 7 

It suffices that 

57k ... / 1 + k5. 

-oe < In 



'£ 



7 Vx/TTfeJi + i^. 



Now let 



/(fc) = ln(l + wfe) - i ln(l + &) - 5 In f 1 + ^fe 

.„ t , k 1 1 (27fc - 4)<5 S + 50fc - 27 



1 + K (5 S 2(1 + fc) 25(l + ^&) 2(l + K&)(l + fc)(25 + 2&) 



Since k > i£, 



29 > 

K 50 



/'(fe) > 



(l + Kfc)(l + fc)(l + ^&) 

Finally, since /(0) = and <Ss < |, one deduces 



/(fe) > 



50 



(l + K&)(l + < J E )(l + ^fc) 



> * 1 Sy = - 125(fc ^Uy 

> ^7k<5s 
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So $ U; s < t + . y — is handled the same way. 
(l + «fc)- p $ 0(a) , (1+fa) -is(y) 



< (1 + kSv)-* U__£L_ cxp Q (w(x) _ e(a!)) ' (S - (1 + fe)-^) _1 (v(x) - «(*))) 

- orSy cxp {l (v(x) ~ i,(x)y ( s " (1 + fe)_1 ^) ' {v{x) €,{x)) ) 



Now 

(S - (1 + S^-^y 1 = (s ((1 + (fe)S- 1 - E- 1 ) (1 + fe)- 1 ^)" 

= (i + (fe)E- 1 ((i + ^fr 1 - s- 1 ) _1 s- 1 

and 

(u(x) - v{x)Y (S - (1 + fe) _1 E) (u(») - £5(s)) < (1 + (5 s )L- 1 A: 1 2ZA + J s 1 i: 1 A: 1 p7 K i-A 2 _A; 1 5| ; 

<2p 7K (l + 5 s )fe 
We only need to prove that 

1+K(5 S 



Let 

1 + K(5 S 



g(6z) = In 



K 1 Kl5s + 2k — 1 



1 + k<5 s 2(1 + fa) 2(l + fa)(l + K fa) 
Provided that k>\ and fa < §, 

5 fe) > 2(1 + §)(! + §«)■ 

Finally, since g(Q) = 0, 

2k -1 5(2/6-1) 7 



s(fe) > 2(1 + f)(1 + fK) fe = n(ITf) fe " 5 7k& - (1 + fe) 7 ^ s - 

One deduces (1 + fifa^^yi+^-if;^) < $ v (x)Mv)- D 



C.5 Proof of inequalities used for bracketing entropy's decomposition 

For sake of completeness, we prove here the inequalities of Lemma 11 and 12 of |9| used in the 
proof of Lemma |U 
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Proof of Lemma 11. For all x in X , 



< fc (7r-(x)^(x,.),7r+(x)V + (x,.)) 



fc=i 



A" 



J2 ( y^t ( x ) [y^ifav) - y^'k ( x >y) 

+y^k( x ^y) (y^k( x ) - y^k( x )j) d v 
/ XX i x ) (\A# (*, y) - \j^k (*, y)) d y 

+ / XX (^ ly n U x )- y n k( x )) d v 



^tt+(x) j m&xd 2 y (i;+(x,.),i;]:(x,.))+dl(TT + (x),TT-(x))m'AX ^ (x,y)dy 



vfe=l 



+ 2 E\X^ y<( x ) - \Afc SJJ d y {i>i{x,.),^{x,.))J J i>l{x,y)dy 
- (XX^)) m ^ x d 2 y (^(x,.),^~(x,.))+d 2 k (TT + (x),TT-(x))m&x ip^(x,y)dy 
+2maxW / ij)^ (x,y)dy max d y (^(x, .),tp^ (x, .)) f XX( g ) J d fc (7r + (2:),7r _ (a;)) 



< [ maxd y (ijj^(x,.),ijj k (x,.)) A 



K 



\ fc=i 



+4(tt + (.t),tt (z))rnax^/ / ip k (x,y)dy 



U 
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Proof of Lemma 12. For all x in X, 

K K 



d l XlM^M^-)'^ 7 ^)^^-) = ^TT+{x)tlj+{x 1 y)dy 
\fc=i fe=i ' J 



K 



\ fe=l 



\J2 n k ( x )^k ( x ,y) d v 



„ K . K 

- I *52 7r t( x )^k( x 'y) d V+ / ^^{x^ix^^y 
J fc=i ^ fc=i 



2 / 5Z \l 7: k( x )i } k( x ^y)\l 1T k ( X )A ( x >y) d y 



^^(Tr-Cx^-Cx.O.Tr+C^+Ca:,.)) 

C.6 Proof of lemmas used for Gaussian's bracketing entropy 
C.6.1 Proof of Lemma 

Proof. 

1 



<*s < 



5 1 5 

:- < - < 



sfcosMfj + P 5f + iP 5J(i) 2 + 



1 (5 2V2 2 

< _ < 



i^ 2 4_iP h\fip 5p 



C.6.2 Proof of Lemma M 

Proof. 



2-22 d / 2 ((l + fe) + (l + fe)- 1 ) d/2 = 2\l-( 



e ln(l+5 E ) +e -ln(l+(5 E ) 



-d/2N 



= 2(l- (cosh (ln(l+fe)))^ /2 
= 2/(ln(l + fe)) 



D 



D 



where f{x) = 1 — cosh(x) d ^ 2 . Studying this function yields 



f(x) = -sinh(a-)cosh(x)- d / 2 ' 1 

/"(*) = ^cosh(.r rf / 2 -^(^ 



2 l \2 J Vcosh(x) 



+ 1 J sinh(x) 2 cosh(x)" d/2 " 2 
I cosh(a;)- d/2 
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as cosh(x) > 1, we have thus 

/"(*) < { 
Now since /(0) = and /'(O) = 0, this implies for any x > 

„, . dx 2 d 2 x 2 

f(x) < < . 

Jy ' ~ 2 2 - 2 2 



We deduce thus that 

-2 2 d / 2 (n + ^ + n +^- 1 r d/2 < 



2 - 2 2 d / 2 ((1 + fc) + (1 + fe)- 1 ) d/2 < ^ 2 (ln(l + fe)) 2 



and using ln(l + fe) < fe 



2 - 2 2 d / 2 ((1 + fc) + (1 + fe)- 1 )^ 72 < \d 2 b\ . 



Now, 



(1 + nS E ) d + (1 + k5z) d - 2 = 2 (cosh(dln(l + refe)) - 1) = 2g(d\n(l + k5 s )) 
with g(x) = cosh(x) — 1. Studying this function yields 

g'{x) = sinh(x) and g"(x) — cosh(x) 

and thus, since g(0) — and <?'(0) = 0, for any < x < c 

x 2 
g{x) < cosh(c)y. 

Since ln(l + kSs) < kSs, d5-£ < c implies dln(l + kSs) < Kc, we obtain thus 

(1 + K(5 s ) d + (1 + K(5 s ) _d - 2 < cosh(nc)d 2 (ln(l + k8^)) 2 < k 2 cosh(Kc)d 2 r$|. 



C.6.3 Proof of Lemmi 
Proof. By definition, 



,x| 2 



.>r 



»=1 i=l 

= (i + fe)!- 1 J; i-/i^x| 2 - (i + j^l- 1 J2 KM 

»=1 i=l 

+ (l + fe)L- 1 ^A^ 1 |^x| 2 -(l + ,5 s )L- 1 ^Av i 1 |^ 
+ (i + fe)Z- 1 ^A M 1 |^x| 2 -L- 1 ^Av i 1 l^ 



n 



.■rl 2 



2 



i=l i=l 



RR n° 8281 



Gaussian Mixture Regression model with logistic weights, a penalized maximum likelihood approachib 



Along the same lines, 

x> (V 1 - (i + fers- 1 ) x = l- 1 e a-}\d'm 2 (i + ^r'l- 1 e K,m*\ 2 

i=l i=\ 

=L- i f:^-/i^i 2 -(n-fer 1 i- 1 E^i^i 2 

+ {l + 5^L^J2A-l\D[x\ 2 -{l + 5^)- l L- 1 J2A-l\D' l , 

i=l i=\ 

+ (i + fe)- 1 !- 1 J2 Kl W - C 1 + fe)" 1 !- 1 E i", 1 |Dj a 



Now 



Ei^W-E^i^ 



; = 1 



i=l 



<Ea-/|i^i 2 -w 

i=l 

i=l 



;=i 



\ 1/2 / p 



1/2 






Furthermore, 



E^iA'*! 2 -EA'il^ 



(=1 



We notice then that 



< 



7 , Ai,i — A,j 



lA'xl 2 



^a^EIA^^aA: 1 ^ 



— i ii ■■ ^ 



(=i 



(l + S^L'^Ar^Dixl 2 -L-^ArJ\D'M 2 = Ul + S^L- 1 -L-^J^^M 



i=l 



-1 \-l|l™ll2 



^(Se-SJL-'X+'Wx 



while 



i-^^i^i 2 - a +&)- 1 £- 1 £XiW = f^ 1 - a +^r 1 i- 1 ) E^mI^ 



; = 1 



i=l 



;=i 



> (1- (l + 5 s )- 1 )Z- 1 A+ 1 i|^H 2 
*= \-ir-Hi-iia 



> 



1 + 5* 



-AT^-ldl 



RR n° 8281 



Gaussian Mixture Regression model with logistic weights, a penalized maximum likelihood approachSG 

We deduce thus that 
x' ((1 + S^t- 1 - S- 1 ) x > (<J E - <5 L )L- 1 A; 1 ||x|| 2 - (1 + S^L^XZ 1 (25 D + 25 A ) \\x\\ 2 

> L- 1 ((fe - <5 L )A; 1 - (1 + S^XZ 1 (25 d + S A )) \\x\\ 2 



and 



c 1 (V 1 - (1 + fe)- 1 ^ 1 ) x > -^-L-^WxW 2 - (1 + fe)- 1 !-^: 1 (25 D + S A ) ||af 
\ 1 1 + Os 

> t^— (feA; 1 - AI 1 (2S d + 8 A j) \\x\\ 2 

1 + OE 



□ 
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